This project was done as a part of “Data Analysis” course as part of our teams studies in Digital Sciences for High-Tech in the University of Tel-Aviv. Our team has great interest in using our studies for exploring and in the future maybe even developing tools for improving the way we treat our planet. Therefore our subject is CO2 emissions in congestion with the world happiness index.
happy_index_2005 <- "Data/world-happiness-report-2005-2020.csv"
happy_index_2021 <- "Data/world-happiness-report-2021.csv"
emission <- "Data/co2_emission.csv"
population <-"Data/population.csv"
df_2005 <- read.csv(happy_index_2005)
df_2021 <- read.csv(happy_index_2021)
df_emiss <- read.csv(emission, col.names = c("Entity", "Code" , "Year", "co2"))
df_pop <- read.csv(population, col.names = c("Entity", "Code" , "Year", "Population"))
The first Data consists of the UN happiness index for the years 2005-2017.
as_tibble(df_2005)
The second data is the happiness index report for 2021.
as_tibble(df_2021)
The third data contains CO2 emissions from around 1700 up until 2017, by countries and continents.
as_tibble(df_emiss)
The final data is a list of population sizes by countries for about the same years as the CO2 data.
as_tibble(df_pop)
As seen in the table above, the main dataset regarding emissions has only 4 features, of which 2 are identical (countries and country code). We will now examine the summary of it’s characteristics:
summary(df_emiss)
Entity Code Year co2
Length:20853 Length:20853 Min. :1751 Min. :-6.255e+08
Class :character Class :character 1st Qu.:1932 1st Qu.: 3.188e+05
Mode :character Mode :character Median :1971 Median : 3.829e+06
Mean :1953 Mean : 1.931e+08
3rd Qu.:1995 3rd Qu.: 3.707e+07
Max. :2017 Max. : 3.615e+10
As seen above, we can see the minimal year, minimal amount of CO2 emissions and same for mean and max values. We also learn about the size of the data, around ~21k rows.
[1] "ï..Country.name" "year" "Life.Ladder"
[4] "Log.GDP.per.capita" "Social.support" "Healthy.life.expectancy.at.birth"
[7] "Freedom.to.make.life.choices" "Generosity" "Perceptions.of.corruption"
[10] "Positive.affect" "Negative.affect"
Joining, by = "Entity"
We have just created the two main datasets to work with, which includes important features from all of the previous datasets and some features that we have created above:
We also made a choice to limit our data regarding emissions to after 1950. The reason is that the rise in values is almost exponential after the years of WW2. When plotted on a graph, placing values from before and after 1950 they become incomparable.
The summary for df_merge:
summary(df_merge)
Entity avg.ratio avg.Population avg.co2 Life.Ladder sum.dif
Length:138 Min. : 0.03123 Min. :2.900e+05 Min. :2.539e+05 Min. :2.662 Min. :-675355128
Class :character 1st Qu.: 0.67361 1st Qu.:4.486e+06 1st Qu.:4.375e+06 1st Qu.:4.593 1st Qu.: 142380
Mode :character Median : 3.04396 Median :9.762e+06 Median :2.017e+07 Median :5.587 Median : 6399728
Mean : 4.87430 Mean :4.360e+07 Mean :1.902e+08 Mean :5.463 Mean : 102976353
3rd Qu.: 7.70521 3rd Qu.:2.873e+07 3rd Qu.:9.600e+07 3rd Qu.:6.262 3rd Qu.: 28576960
Max. :25.67181 Max. :1.296e+09 Max. :5.558e+09 Max. :7.788 Max. :7786512016
And for df_all:
summary(df_all)
Entity Code Year co2 Diff Population ratio
Length:17854 Length:17854 Min. :1950 Min. :-6.255e+08 Min. :-600971748 Min. :1.000e+03 Min. : 0.000
Class :character Class :character 1st Qu.:1967 1st Qu.: 4.746e+05 1st Qu.: -9810 1st Qu.:2.400e+05 1st Qu.: 0.408
Mode :character Mode :character Median :1984 Median : 4.430e+06 Median : 43968 Median :3.386e+06 Median : 1.841
Mean :1984 Mean : 2.423e+08 Mean : 5310082 Mean :6.108e+07 Mean : 5.211
3rd Qu.:2002 3rd Qu.: 4.593e+07 3rd Qu.: 1018269 3rd Qu.:1.230e+07 3rd Qu.: 6.384
Max. :2019 Max. : 3.615e+10 Max. :1543508239 Max. :7.713e+09 Max. :403.351
NA's :3595 NA's :3595 NA's :914 NA's :4509
skim(df_merge) # TODO Mayber remove this ? seems to much
-- Data Summary ------------------------
Values
Name df_merge
Number of rows 138
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None
-- Variable type: character ------------------------------------------------------------------------------------------------------------------
# A tibble: 1 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 Entity 0 1 4 24 0 138 0
-- Variable type: numeric --------------------------------------------------------------------------------------------------------------------
# A tibble: 5 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 avg.ratio 0 1 4.87 5.71 3.12e-2 0.674 3.04 7.71 25.7 ▇▃▁▁▁
2 avg.Population 0 1 43601578. 146848485. 2.90e+5 4485573. 9761806. 28733782. 1295924678. ▇▁▁▁▁
3 avg.co2 0 1 190209421. 689048147. 2.54e+5 4375076. 20167978. 95996205. 5558492798. ▇▁▁▁▁
4 Life.Ladder 0 1 5.46 1.16 2.66e+0 4.59 5.59 6.26 7.79 ▂▆▇▇▅
5 sum.dif 0 1 102976353. 694705139. -6.75e+8 142380. 6399728. 28576960. 7786512016 ▇▁▁▁▁
In this plot the different countries in the data are shown in deeper colors for higher values of emission to population ratio in the year of 2017
d = df_all %>%
filter(Year==2017)
l <- list(color = toRGB("grey"), width = 0.2)
# specify map projection/options
g <- list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = 'Mercator')
)
p <- plot_geo(d) %>%
add_trace(
z = ~ratio, color = ~ratio, colors = 'Reds',
text = ~Entity, locations = ~Code, marker = list(line = l)
) %>%
colorbar(title = 'CO2/Population', thickness=15) %>%
layout(
title = 'World Ratio of CO2/Population',
geo = g,
autosize = F
)
p <- ggplotly(p, width = 3000, height = 1500, automargin = TRUE)
p
The following plot shows for the years 2000-2017 the amount of CO2 emitted from each country vs it’s population. The ratio is shown as well as the size of each dot. For comfort of the view, each country is allocated to it’s fitting continent and a color.
fig <- d %>%
plot_ly(
x = ~co2,
y = ~Population,
size = ~ratio,
frame = ~Year,
text = ~Entity,
color = ~continent,
type = 'scatter',
mode = 'markers',
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "Yearly CO2 Emissions by Countries vs Population",
xaxis = list(
type = "log"
# autosize = F
)
)
fig
We would like to develop this graph forward and see it with correlation to the life ladder index from 2005-2017. For this purpose we would like to create another df, containing both happiness index scores and data from “df_all”, which contains continents data as well now:
We would receive the following plot:
fig <- plot_ly(
data = df_2005_2017,
x = ~co2,
y = ~Life.Ladder,
size = ~ratio,
frame = ~Year,
text = ~Entity,
color = ~continent,
type = 'scatter',
mode = 'markers',
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "Yearly CO2 Emissions by Countries vs Life Ladder",
xaxis = list(
type = "log"
)
)
fig
And yet another scatter plot for different countries:
df_3d <- df_2005_2017 %>%
filter(Year == 2017)
fig <- df_3d %>%
plot_ly(
x = ~Population,
y = ~ratio,
z = ~Life.Ladder,
# size = ~Life.Ladder,
# frame = ~Entity,
text = ~Entity,
color = ~continent,
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "CO2 Emissions vs Life Ladder, Filtered by Countries, 2017"
)
fig <- fig %>% add_markers()
fig
# d <- df_all[ , c("Entity", "Year", "Population")]
d <- df_all %>%
filter(Entity != "World") %>%
filter(co2 != 0)
d
pp <- streamgraph(d, key="Entity",
order = "asis",
value="co2",
date="Year",
offset="zero",
sort="co2"
) %>%
sg_axis_y(tick_format = "e") %>%
sg_legend(show=TRUE, label="Country: ") %>%
sg_title("CO2 Emissions by Years and Countries, 1950-2017")
pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
pp <- streamgraph(df_2005, key = "Entity",
# order = "reverse",
value="Life.Ladder",
date="year"
# offset="zero",
) %>%
sg_fill_brewer("Blues") %>%
sg_legend(show=TRUE, label="Country: ") %>%
sg_title("Happiness Index by Years and Countries, 1950-2017")
pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
# g <- ggplot(data = df_2005, aes(x = year, y = Life.Ladder, group = Entity)) +
# geom_line() +
# geom_line(color="#69b3a2") +
# theme_ipsum()
#
# g <- g + labs(title = "Contries Hapinness by Years")
# g <- ggplotly(g)
# g
top_emiss_2017 <- df_emiss %>%
filter(Year == 2017) %>%
filter(Code != '') %>%
filter(Entity != "World") %>%
top_n( 10, co2) %>%
arrange(desc(co2)) %>%
head(10)
top_countries = top_emiss_2017$Entity
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_countries, ]
, aes(x = Year, y = co2, group = Entity)) +
geom_line() +
labs(title = "Top 10 Countries by CO2 Emissions") +
geom_line(aes(col = Entity)) +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
g <- ggplot(data = df_all[df_all$Entity %in% top_countries, ]
, aes(x = Year, y = ratio, group = Entity)) +
geom_line(aes(col = Entity)) +
labs(title = "Top 10 Countries by CO2 Emissions to Population ratio") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
NA
d <- df_all[df_all$Entity %in% top_countries, ]
fig <- plot_ly(d,
x = ~Entity,
y = ~Diff,
type = "box",
color = ~Year,
height = 500,
width = 900,
automargin = TRUE)
fig <- fig %>% layout(
title = "Difference in Emission for Top 10 Polluting Countries",
boxmode = "group")
fig
NA
top_10_happiness <- df_merge %>%
filter(rank(desc(Life.Ladder))<=10)
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_10_happiness$Entity, ]
, aes(x = Year, y = co2, group = Entity)) +
geom_line() +
labs(title = "Top 10 happiness Countries by CO2 Emissions") +
theme_ipsum() +
geom_line(aes(col = Entity))
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
In this section we are going to suggest two types of correlation based on our exploration of the data:
g <- ggplot(data = df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
geom_point(aes(colour = Entity)) +
labs(title = "Average Ratio vs Happiness Index") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
In the following plot we have removed china and India because they are sort of “outliers” in a sense that they have displayed much larger grow in emissions compared to other countries, therefore making it very hard to disply on the same plot.
df <- subset(df_merge, Entity != "China" & Entity != "India" )
g <- ggplot(data = df, aes(x = Life.Ladder , y = sum.dif)) +
geom_point(aes(colour = Entity)) +
labs(title = "Sum of Emissions Difference vs 2017 Happiness Index (China, India ex.)") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
# Part 4: Modeling
We are now going to introduce the linear model:
linearmod <- lm(Life.Ladder ~ avg.ratio, data=df_merge)
summary(linearmod)
Call:
lm(formula = Life.Ladder ~ avg.ratio, data = df_merge)
Residuals:
Min 1Q Median 3Q Max
-2.2447 -0.6262 -0.0959 0.6765 2.1682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.88716 0.10612 46.055 < 2e-16 ***
avg.ratio 0.11818 0.01416 8.346 7.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9468 on 136 degrees of freedom
Multiple R-squared: 0.3387, Adjusted R-squared: 0.3338
F-statistic: 69.65 on 1 and 136 DF, p-value: 7.093e-14
confint(linearmod, level=0.95)
2.5 % 97.5 %
(Intercept) 4.67731147 5.0970157
avg.ratio 0.09017982 0.1461869
p <- plot_ly(
x = fitted(linearmod),
y = residuals(linearmod),
height = 500,
width = 900,
automargin = TRUE)
p<- p %>% layout(
title = "Residual plot for the Linear Model",
autosize = T,
yaxis = list(title = 'Residuals'),
xaxis = list(title = 'Fitted Linear Model')
)
p
g <- ggplot(df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
geom_point(aes(colour = Entity)) +
theme_bw() +
stat_smooth(method = "lm") +
labs(title = "Linear Modelling Happiness Index and Average Ratio")
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
`geom_smooth()` using formula 'y ~ x'
g
NA
NA
numeric_tidy <- df_merge[-1]
corr_data <- cor(numeric_tidy)
g <- ggcorrplot(corr_data, hc.order = TRUE, type = "lower",
outline.col = "white") +
labs(title = "Correlation Matrix") +
theme_ipsum()
# colors = c("darkolivegreen2", "white", "darkolivegreen"))
ax <- list(
title = "",
showgrid = FALSE)
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE) %>%
layout(xaxis=ax, yaxis=ax)
g
NA
NA
The following section presents another model and it’s summary:
gmod <- lm(Life.Ladder ~ log(avg.ratio), data=df_merge)
summary(gmod)
Call:
lm(formula = Life.Ladder ~ log(avg.ratio), data = df_merge)
Residuals:
Min 1Q Median 3Q Max
-2.03388 -0.55089 0.04812 0.59537 1.89653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.15234 0.07550 68.25 <2e-16 ***
log(avg.ratio) 0.48724 0.04229 11.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8283 on 136 degrees of freedom
Multiple R-squared: 0.4939, Adjusted R-squared: 0.4902
F-statistic: 132.7 on 1 and 136 DF, p-value: < 2.2e-16
#plot x
p <- qplot(x = avg.ratio, Life.Ladder, data=df_merge)
p <- p + geom_smooth(method = "glm", formula = y~log(x), family = binomial(link = 'log')) +
geom_point(aes(colour=Entity)) +
theme_ipsum() +
labs(title = "Linear Regression Model - Average Ratio to Life Ladder Score")
p <- ggplotly(p, height = 500,
width = 900,
automargin = TRUE)
p
On this model, We can see a week correlation, but an existing one to the logistic regression. Interestingly enough, The “happiest” countries seem to have a higher ratio of CO2 emissions as well. Also worth mentioning from this plot: * The US is located far right on this map, meaning it’s emitting a lot of CO2 in proportion to it’s population, but it is pretty high on the happiness index as well. * Kuwait, being the leader of this unfortunate characteristic, also has a pretty decent Life Ladder score. *
#multiple regression
ex_list <- c("China", "India", "United States")
d <- df_merge[(!(df_merge$Entity %in% ex_list)),]
x <- d$avg.ratio + d$sum.dif + d$avg.co2
multi <- lm(Life.Ladder ~ x, data=d)
p <- qplot(x, Life.Ladder, data=d)
p <- p +
geom_smooth(method = "lm",
formula = y~x, family = binomial(link = 'log')) +
geom_point(aes(colour=Entity)) +
theme_ipsum() +
labs(title = "Linear Regression Model -Multiple Variables to Life Ladder Score")
Ignoring unknown parameters: family
fig <- ggplotly(p, height = 500,
width = 900,
automargin = TRUE)
fig
NA
Now let’s say we want to compare the differences between the top 10 happiness countries avg.ratio is higher then other countries. In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.
mA - Weighted average of the 10 most happiest countries. mB - Weighted average of the rest of the countries.
Our research questions:
Is the weighted average of the 10 happiest countries (mA) greater than the weighted average of the other countries (mB)?
H0:mA≥mB - The null hypothesis
Ha:mA<mB (less) - The alternative hypothesis
# Create a data frame
T_data <- df_merge %>%
select(Entity, avg.ratio,) %>%
mutate(group = ifelse(Entity %in% top_10_happiness$Entity, "Top 10 Contries", "Other Countries"))
## TODO Yarden - Need to add to T_data a column of Life.Ladder
## TODO make a graph that shows the difference of Life Ladder and avg.ratio by groups "top_10" and "others".
g <- ggboxplot(T_data, x = "group", y = "avg.ratio",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "avg. ratio", xlab = "Group") +
theme_ipsum() +
labs(title = "Average Ratio Compared - Top 10 Countries vs Other Countries")
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
NA
NA
#Do the two group have the same variances?
#We’ll use F-test to test for homogeneity in variances.
res.ftest <- var.test(avg.ratio ~ group, data = T_data)
res.ftest
F test to compare two variances
data: avg.ratio by group
F = 3.8362, num df = 127, denom df = 9, p-value = 0.03252
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.132105 8.499782
sample estimates:
ratio of variances
3.836239
The p-value of the F-test is p = 0.03252. It is lower than the significance level alpha = 0.05. In conclusion, there is a significant difference between the variances of the two data sets. Therefore, we used the T-test and assumed that the variances are not equal - according to case 1 of hypothesis testing.
# Compute t-test
res <- t.test(avg.ratio ~ group, data = T_data, var.equal = FALSE, alternative = "less")
res
Welch Two Sample t-test
data: avg.ratio by group
t = -5.1609, df = 15.107, p-value = 5.678e-05
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -3.574853
sample estimates:
mean in group Other Countries mean in group Top 10 Contries
4.482087 9.894609